This is the code for my submission in Kaggle’s “House Prices: Advanced Regression Techniques” competition. This competition was quite a bit of fun because of the numerous ways I could clean the data, engineer new features, and choose how to build my model. The goal for this competition was to minimize RMSLE when predicting the selling price of a house. If you would like to learn more about this competition, visit https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The sections for my analysis are: - Inspecting the Data - Cleaning the Data (only visible in the .Rmd file) - Feature Engineering (only visible in the .Rmd file) - Building the Model - Summary
The dimensions of the train dataset:
## [1] 1460 81
Not the biggest dataset, but that’s alright.
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
We have quite a bit of missing data here, let’s take a look to see how much.
## PoolQC MiscFeature Alley Fence FireplaceQu
## 1453 1406 1369 1179 690
## LotFrontage GarageType GarageYrBlt GarageFinish GarageQual
## 259 81 81 81 81
## GarageCond BsmtExposure BsmtFinType2 BsmtQual BsmtCond
## 81 38 38 37 37
## BsmtFinType1 MasVnrType MasVnrArea Electrical Id
## 37 8 8 1 0
## MSSubClass MSZoning LotArea Street LotShape
## 0 0 0 0 0
## LandContour Utilities LotConfig LandSlope Neighborhood
## 0 0 0 0 0
## Condition1 Condition2 BldgType HouseStyle OverallQual
## 0 0 0 0 0
## OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 0 0 0 0 0
## Exterior1st Exterior2nd ExterQual ExterCond Foundation
## 0 0 0 0 0
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 0 0 0 0 0
## HeatingQC CentralAir X1stFlrSF X2ndFlrSF LowQualFinSF
## 0 0 0 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 0 0 0 0 0
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 0 0 0 0 0
## Fireplaces GarageCars GarageArea PavedDrive WoodDeckSF
## 0 0 0 0 0
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## 0 0 0 0 0
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
Ok, so we have quite a few missing values in some features. Let’s fix that.
**Note: This is where I have cleaned the features. If you would like to see how I performed this step, please view the .RMD file if you are not.
What are the most important numerical features, based on correlation?
Let’s take a look at some features that are highly correlated with selling price.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
The distribution is a little long-tail and there are two outliers with square footage greater than 4500, and a sale price less than 200,000. Let’s see how the plot changes without those points.
That looks better, but before we remove those outliers, let’s take a look at the correlation. First with the two datapoints, then without.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6821200 0.7332695
## sample estimates:
## cor
## 0.7086245
##
## Pearson's product-moment correlation
##
## data: a$SalePrice and a$GrLivArea
## t = 41.358, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7104365 0.7577160
## sample estimates:
## cor
## 0.7349682
That should help improve the results. Let’s remove those two points.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$OverallQual
## t = 50.141, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7761447 0.8138630
## sample estimates:
## cor
## 0.7957743
##
## 1 2 3 4 5 6 7 8 9 10
## 2 3 20 116 397 374 319 168 43 16
Everything looks good with OverallQual.
Although OveralCond is not highly correlated with SalePrice, I want to have a closer look, because I thought it would have similar values to OverallQual.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$OverallCond
## t = -2.9834, df = 1456, p-value = 0.002898
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12877065 -0.02671789
## sample estimates:
## cor
## -0.07794846
##
## 1 2 3 4 5 6 7 8 9 10
## 2 3 20 116 397 374 319 168 43 16
I don’t notice anything worrying/wrong with the data. It looks like the huge range of selling prices with OverallCond of 5 might have ruined any chance of a strong correlation.
##
## 0 1 2 3
## 9 650 767 32
This looks good, but now I’m going to focus on continuous variables to see if we can find any more outliers.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$X1stFlrSF and CleanedTrain$SalePrice
## t = 31.08, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5996337 0.6614237
## sample estimates:
## cor
## 0.6315304
Everything looks fine with first floor square footage.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$TotalBsmtSF and CleanedTrain$SalePrice
## t = 32.738, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6205598 0.6797668
## sample estimates:
## cor
## 0.6511529
Everything looks good here.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$YearBuilt and CleanedTrain$SalePrice
## t = 23.451, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4853166 0.5598955
## sample estimates:
## cor
## 0.5236084
It’s interesting to see the housing booms and busts (~1960 and ~2013), plus everything looks fine.
There are some big outliers here. Below you’ll see the end result after I experimented with a range of subsets, between lot areas of 15,000 to the maximum value, and 25,00 seemed to be the optimal limit. I also compared the correlations between the SalePrice and other features with, and without, the outliers. Removing the outliers looks to have better or equal correlations, so we’ll go ahead and remove those datapoints.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$LotArea
## t = 10.622, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2198689 0.3151775
## sample estimates:
## cor
## 0.2681793
##
## Pearson's product-moment correlation
##
## data: a$SalePrice and a$LotArea
## t = 17.522, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3772598 0.4626693
## sample estimates:
## cor
## 0.4208969
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$LotFrontage
## t = 6.9388, df = 1426, p-value = 5.982e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1300688 0.2304373
## sample estimates:
## cor
## 0.1807235
Although this is a significant outlier, my model performs better with this datapoint, so I will leave it in.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$X2ndFlrSF
## t = 12.283, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2616409 0.3554870
## sample estimates:
## cor
## 0.3093169
Many houses do not have second floors. The data looks fine.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$WoodDeckSF
## t = 12.041, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2559486 0.3501455
## sample estimates:
## cor
## 0.3037892
Many people also do not have wooddecks, but again, the data looks fine.
##
## Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert
## 17 2 16 58 18 150 50 98 78
## IDOTRR MeadowV Mitchel NAmes NoRidge NPkVill NridgHt NWAmes OldTown
## 37 17 45 222 38 9 77 73 113
## Sawyer SawyerW Somerst StoneBr SWISU Timber Veenker
## 73 59 86 24 25 33 10
There are definitely some more expensive neighborhoods, such as NoRidge and NridgHt. We’ll use this information for the feature engineering section.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$GarageArea and CleanedTrain$SalePrice
## t = 31.304, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6063996 0.6679590
## sample estimates:
## cor
## 0.6381983
Hmm, let’s see what happens if we remove values greater than 1248.
##
## Pearson's product-moment correlation
##
## data: a$GarageArea and a$SalePrice
## t = 31.983, df = 1424, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6152984 0.6757865
## sample estimates:
## cor
## 0.6465575
It’s only a slight improvement, so we’ll keep the datapoints to have more information for build our model.
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$MonthYearSold and CleanedTrain$SalePrice
## t = -0.7478, df = 1426, p-value = 0.4547
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07159945 0.03210821
## sample estimates:
## cor
## -0.01979888
Given that this dataset takes place during the recession, I was wondering if the selling prices would drop during 2008-2010…they didn’t.
##
## 20 30 40 45 50 60 70 75 80 85 90 120 160 180 190
## 521 68 4 12 142 290 60 15 58 20 51 87 63 10 27
It’s a little tough to see any strong insights here. 2-STORY 1946 & NEWER (#60) houses are generally worth the most, but so are 1-STORY PUD (Planned Unit Development, #120) - 1946 & NEWER. Perhaps the number of stories doesn’t matter as much as when the house was made.
##
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
## 151 14 707 8 11 435 37 65
Just as a reminder, the correlation between sale price and year built is 0.5487, so I am confident in saying that year built has a stronger correlation with sale price than house style / number of stories.
The earliest value for YearRemodAdd:
## [1] 1950
The number of houses with this minimum value:
##
## FALSE TRUE
## 1251 177
##
## Pearson's product-moment correlation
##
## data: CleanedTrain$SalePrice and CleanedTrain$YearRemodAdd
## t = 22.922, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4799375 0.5558064
## sample estimates:
## cor
## 0.518893
There are way too many houses that have their value for YearRemodAdd as 1950. I am going to assume that this is the earliest date possible for this value, which has led to the error. For houses with YearRemodAdd = 1950, I am going to change their value to the average difference between when the house was built and remodelled, plus the year the house was built. Here’s an example to clear the confusion: YearBuilt = 1930, average difference between year built and remodelled = 4.38, new value for YearRemodAdd = 1934.38.
##
## Pearson's product-moment correlation
##
## data: a$SalePrice and a$YearRemodAdd
## t = 22.023, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4640384 0.5415082
## sample estimates:
## cor
## 0.5037856
Although the correlation went down, I believe these new values more accurately represent the real world.
Before we moved onto feature engineering, it would definitely be worth having a look at the sale prices.
I’ve been seeing the two data points with a sale price of over $700,000 in many graphs. After looking at the quality of my model, I’ve decided to also remove the datapoints with a sale price greater than $600,000.
Now let’s bring our train and test dataset back together to do some feature engineering. **Note: just like with cleaning the data, if you want to see the steps I took, please use the .Rmd file.
Now it’s time to train the model.
## + Fold1: lambda=5e-06, penalty=MCP
## - Fold1: lambda=5e-06, penalty=MCP
## + Fold2: lambda=5e-06, penalty=MCP
## - Fold2: lambda=5e-06, penalty=MCP
## + Fold3: lambda=5e-06, penalty=MCP
## - Fold3: lambda=5e-06, penalty=MCP
## Aggregating results
## Fitting final model on full training set
## + Fold1: lambda=0.002
## - Fold1: lambda=0.002
## + Fold2: lambda=0.002
## - Fold2: lambda=0.002
## + Fold3: lambda=0.002
## - Fold3: lambda=0.002
## Aggregating results
## Fitting final model on full training set
## + Fold1: C=0.6
## - Fold1: C=0.6
## + Fold2: C=0.6
## - Fold2: C=0.6
## + Fold3: C=0.6
## - Fold3: C=0.6
## Aggregating results
## Fitting final model on full training set
These algorithms were chosen after doing spot checks on their initial performance, then their parameters were tuned.
##
## Call:
## summary.resamples(object = results)
##
## Models: rqnc, rqlasso, svmLinear
## Number of resamples: 3
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rqnc 19440 20010 20580 20400 20870 21170 0
## rqlasso 17330 18780 20230 19500 20580 20920 0
## svmLinear 20190 20280 20360 21920 22790 25210 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rqnc 0.9158 0.9167 0.9176 0.9208 0.9233 0.9290 0
## rqlasso 0.9225 0.9245 0.9266 0.9273 0.9298 0.9329 0
## svmLinear 0.8847 0.8985 0.9124 0.9088 0.9208 0.9292 0
## rqnc rqlasso svmLinear
## rqnc 1.0000000 0.8642528 0.7831973
## rqlasso 0.8642528 1.0000000 0.3640926
## svmLinear 0.7831973 0.3640926 1.0000000
Ensemble the models together.
## parameter RMSE Rsquared RMSESD RsquaredSD
## 1 none 19363.02 0.925703 1574.123 0.008220938
## The following models were ensembled: rqnc, rqlasso, svmLinear
## They were weighted:
## -2781.4167 -0.0264 0.7926 0.2473
## The resulting RMSE is: 19363.0215
## The fit for each individual model on the RMSE is:
## method RMSE RMSESD
## rqnc 20397.67 876.922
## rqlasso 19496.63 1904.032
## svmLinear 21919.67 2848.452
Summary of input algorithms, then the ensembled model:
## rqnc rqlasso svmLinear
## Min. : 51205 Min. : 53439 Min. : 45210
## 1st Qu.:127040 1st Qu.:126563 1st Qu.:126814
## Median :156419 Median :161618 Median :158451
## Mean :178061 Mean :178661 Mean :178532
## 3rd Qu.:207094 3rd Qu.:207597 3rd Qu.:208246
## Max. :463001 Max. :467352 Max. :472366
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49810 125700 161100 178300 207400 472200
## [1] "RMSLE of the testing values"
## [1] 0.1276157
First ten predicted housing prices of the dataset to be submitted for competition.
## Id SalePrice
## 1461 1461 118700.9
## 1462 1462 164559.4
## 1463 1463 190506.6
## 1464 1464 201567.9
## 1465 1465 195542.3
## 1466 1466 169269.3
## 1467 1467 182911.3
## 1468 1468 160134.2
## 1469 1469 193482.5
## 1470 1470 123739.8
Although this was not the largest dataset, there were still some challenges and clever thinking required to do well in this competition. I believed that I have done a good job cleaning the data, creating new features, and building my ensemble model because I currently rank in the top 16% of submissions.
Below you can see a summary of my predicted selling prices, and ten most important features. As expected, square footage and quality play a large role in the selling price of a house. It was neat to see the importance of feature engineering, especially creating artifical features (such as SFxQual), in order to build a accurate model.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33410 125800 158300 178100 211800 934900
## overall rqnc rqlasso svmLinear
## SFxQual 3.519115 3.519115 3.519115 3.519115
## TotalSF 3.011293 3.011293 3.011293 3.011293
## HouseandGarageSF 3.002363 3.002363 3.002363 3.002363
## TotalInsideSF 2.805336 2.805336 2.805336 2.805336
## InsideQuality 2.800064 2.800064 2.800064 2.800064
## TotalQuality 2.777993 2.777993 2.777993 2.777993
## OverallQual 2.654390 2.654390 2.654390 2.654390
## BsmtSFxQual 2.433532 2.433532 2.433532 2.433532
## GrLivArea 2.078718 2.078718 2.078718 2.078718
## ExterQual 1.999226 1.999226 1.999226 1.999226